Code
library(readr)
library(tidyverse)
library(dplyr)
library(readr)For this project, we utilized the FoodData Central API provided by the United States Department of Agriculture (USDA) to access detailed nutritional data.
This API is designed to help developers incorporate nutrient information into their applications by providing comprehensive documentation about the database structure and data elements. Using this API, we collected data on 10,000 branded food items, capturing 24 features such as brand name, food category, package weight, ingredients, and various nutrients and vitamins. The data were exported to a CSV file and imported into RStudio for analysis.
While the dataset is rich in information, we encountered missing values in some fields, particularly for ingredients, as not all foods have complete nutritional details. Despite this, the data offer valuable insights into nutrient composition, enabling us to identify and address gaps in dietary information. The data source is well-documented and regularly updated by the USDA, ensuring reliability and accessibility for research purposes.
library(readr)
library(tidyverse)
library(dplyr)
library(readr)data <- read_csv("Branded Nutrients_Dataset.csv",
show_col_types = FALSE)In this dataset, there are 10,000 rows of data, where each row refers to a product. There are total 24 columns including information listed below.
“fdcId”
“description”
“foodCategory”
“brandOwner”
“brandName”
“packageWeight”
“publishedDate”
“Protein G”
“Total lipid (fat) G”
“Carbohydrate, by difference G”
“Energy KCAL”
“Total Sugars G”
“Fiber, total dietary G”
“Calcium, Ca MG”
“Iron, Fe MG”
“Sodium, Na MG”
“Vitamin C, total ascorbic acid MG”
“Cholesterol MG”
“Fatty acids, total trans G”
“Fatty acids, total saturated G”
“Vitamin A UG”
“Potassium, K MG”
“Vitamin D (D2 + D3) UG”
“Sugars, added G”
We noticed that except for fdcId and description, all other columns have missing values in different levels. Therefore, we examine the percentage of missing values for all columns.
missing_summary <- data.frame(
Column = colnames(data),
MissingCount = colSums(is.na(data)),
MissingPercent = colSums(is.na(data)) / nrow(data) * 100
) |>
arrange(desc(MissingPercent))
ggplot(missing_summary, aes(x = reorder(Column, -MissingPercent), y = MissingPercent)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
scale_y_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) + # Adjust y-axis breaks
labs(
title = "Percentage of Missing Values per Nutrient",
x = "Nutrients",
y = "Percentage of Missing Values"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))From above, we can see that there are few missing values for main nutrients such as Protein G, Energy KCAL, Total lipid (fat) G, with less than 1% missing values. Also we have most brandOwner information with 2.27% missing values.
For nutrients that are not always displayed on packages, including Vitamins, Sugars, added G, we have high portions of missing values. Therefore, in this project, we would focus on main nutrients in most of the time, and address supplemented nutrients when it is necessary.